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Abstract 

Distributed  file  systems  that  scale  by  partitioning  files  and  directories  among  a  collection  of  servers  inevitably  encounter  cross¬ 
server  operations.  A  common  example  is  a  RENAME  that  moves  a  file  from  a  directory  managed  by  one  server  to  a  directory 
managed  by  another.  Systems  that  provide  the  same  semantics  for  cross-server  operations  as  for  those  that  do  not  span  servers 
traditionally  implement  dedicated  protocols  for  these  rare  operations.  This  paper  suggests  an  alternate  approach  that  exploits 
the  existence  of  dynamic  redistribution  functionality  (e.g.,  for  load  balancing,  incorporation  of  new  servers,  and  so  on).  When  a 
client  request  would  involve  files  on  multiple  servers,  the  system  can  redistribute  those  files  onto  one  server  and  have  it  service  the 
request.  Although  such  redistribution  is  more  expensive  than  a  dedicated  cross-server  protocol,  the  rareness  of  such  operations 
makes  the  overall  performance  impact  minimal.  Analysis  of  NFS  traces  indicates  that  cross-server  operations  make  up  fewer  than 
0.001%  of  client  requests,  and  experiments  with  a  prototype  implementation  show  that  the  performance  impact  is  negligible  when 
such  operations  make  up  as  much  as  0. 01  %  of  operations.  Thus,  when  dynamic  redistribution  functionality  exists  in  the  system, 
cross-server  operations  can  be  handled  with  little  additional  implementation  complexity. 
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1  Introduction 


Distributed  file  serviees  should  be  transparently  sealable.  That  is,  it  should  be  possible  to  inerease  both 
eapaeity  and  throughput  by  adding  servers  and  spreading  data  and  work  among  them.  Further,  users  and 
elient  applieations  should  not  have  to  be  aware  of  whieh  servers  host  whieh  data — the  visible  semanties 
should  be  eonsistent.  This  design  goal  has  been  demonstrated  by  several  systems  (e.g.,  [2,  3,  28]). 

Most  designs  seale  by  partitioning  the  set  of  files  aeross  the  set  of  servers  sueh  that  eaeh  tile  is  managed 
by  one  server.  Most  operations  aeeess  a  single  file,  and  aeeesses  to  files  managed  by  distinet  servers  ean 
proeeed  in  parallel.  ^  But,  a  few  operations  require  aeeess  to  more  than  one  tile  and,  oeeasionally,  the  files 
will  be  managed  by  distinet  servers.  For  example,  a  RENAME  operation  ean  move  a  tile  from  one  direetory 
to  another,  thus  involving  both  direetories.  Some  systems  don’t  guarantee  the  same  semanties  for  operations 
on  multiple  files  that  are  managed  by  distinet  servers,  but  “transparent”  sealability  dietates  that  a  distributed 
tile  system  should.  Continuing  with  the  example,  many  applieations  rely  on  RENAME  operations  to  provide 
speeitie  eonsisteney  semanties  (e.g.,  atomieity). 

The  common  approach  to  upholding  the  semantics  uses  additional  protocols  to  complete  cross-server 
operations.  Doing  so  provides  the  expected  semantics,  but  it  introduces  significant  complexity  to  support  a 
very  uncommon  case — for  example,  cross-directory  renames  represent  less  than  0.001%  of  operations  in 
analyzed  NFS  traces,  and  little  of  that  0.001%  would  involve  more  than  one  server  in  most  deployments. 
This  rareness  also  means  that  cross-server  operations  don’t  naturally  show  up  in  benchmark-based  testing, 
aggravating  the  cost  of  the  added  complexity  (by  requiring  crafted  test  scenarios)  or  resulting  in  robustness 
holes. 

This  paper  promotes  an  alternate  approach:  avoid  cross-server  operations  altogether.  When  a  client 
request  would  involve  files  managed  by  more  than  one  server,  the  system  redistributes  management  respon¬ 
sibilities  such  that  the  affected  files  are  managed  by  the  same  server.  The  ability  to  redistribute  files  is  an 
important  part  of  balancing  load  among  servers,  allowing  a  system  to  address  hot  spots,  capacity  exhaustion, 
and  server  addition  or  removal.  If  it  is  present  for  these  other  reasons,  dynamic  redistribution  can  be  used  to 
eliminate  cross-server  operations  for  “free,”  avoiding  the  need  for  protocols  to  support  them. 

This  paper  describes  a  prototype  metadata  service  that  supports  dynamic  redistribution  and  uses  it  to 
avoid  cross-server  operations.  Each  metadata  server  manages  an  independent  set  of  files  and  stores  its  state 
as  database  tables  in  shared  storage-nodes.  Moving  management  of  some  files  from  one  server  to  another 
involves  two  steps:  checkpointing  the  relevant  database  table(s)  and  passing  ownership  of  them  to  the  new 
server.  When  the  system  needs  to  move  management  of  only  a  subset  of  the  files,  splitting  a  table  can 
reduce  the  overall  impact.  But,  moving  entire  tables  will  work,  and  of  course  they  can  be  moved  back,  if 
appropriate,  after  the  movement-inducing  operation  is  complete.  Clients  contact  the  server  most  recently 
known  to  manage  a  tile  and  are  redirected  if  necessary. 

Experiments  with  the  system  and  trace  analyses  illustrate  the  efficacy  of  using  dynamic  redistribution 
to  eliminate  cross-server  operations.  As  expected,  we  observe  linear  throughput  increases  for  balanced 
workloads  when  increasing  the  number  of  servers.  Also  as  expected,  operations  that  require  redistribution  to 
avoid  being  cross-server  are  significantly  slower  (e.g.,  VlOOqs  vs.  lOOOqs  for  RENAME)  than  like  operations 
that  do  not.  Analysis  of  long-term  NES  traces,  however,  show  that  such  operations  are  very  rare — fewer 
than  0.001%  of  operations  could  be  cross-server  (e.g.,  cross-directory  renames),  and  their  locality  makes 
most  of  them  unlikely  to  be  cross-server  in  practice  (e.g.,  because  they  rename  a  tile  to  the  immediate  parent 
directory).  Thus,  although  cross-server  operations  are  much  slower,  the  impact  on  overall  performance  is 
minimal.  Balanced  against  the  simplicity  of  avoiding  them,  we  believe  this  approach  is  right  for  many 
distributed  file  systems  and  perhaps  other  distributed  systems  as  well. 

^We  talk  in  terms  of  one  server  per  file,  for  clarity.  Each  “server”  in  this  discussion  could  be  a  fault-tolerant  group  of  replicas 
(i.e.,  a  replicated  state  machine  [26]),  as  in  Farsite  [2]  for  example,  with  no  change. 
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The  remainder  of  this  paper  is  organized  as  follows.  Seetion  2  reviews  sealable  file  serviees,  eross- 
server  operations,  and  related  work.  Seetion  3  diseusses  design  issues  for  dynamie  redistribution  support 
and  its  use  in  avoiding  eross-server  operations.  Seetion  4  deseribes  our  implementation  of  sueh  support  in 
a  prototype  distributed  storage  system.  Seetion  5  evaluates  this  approaeh  to  handling  potential  eross-server 
operations  with  experiments  and  traee  analyses. 

2  Background  and  related  work 

This  seetion  diseusses  sealability  and  transpareney  in  distributed  file  systems,  eross-server  operations,  and 
approaehes  to  supporting  them. 

2.1  Scalability  in  distributed  file  systems 

There  have  been  many  distributed  file  systems  proposed  and  implemented  over  the  years.  To  provide  eontext 
for  this  work,  we  eategorize  them  into  three  groups  based  on  what  they  ean  seale  transparently.  We  use  “seal¬ 
ing”  to  refer  to  inereasing  storage  eapaeity  and  throughput  by  adding  more  servers,  as  eontrasted  with  using 
teehniques  like  elient-side  eaehing  to  inerease  the  number  of  elients  that  a  given  server  ean  support  [16]. 
Thus,  sealing  happens  by  inereasing  the  number  of  servers  and  partitioning  data  and  responsibilities  aeross 
them.  “Transparent”  sealing  implies  sealing  without  elient  applieations  having  to  be  aware  of  how  data  is 
spread  aeross  servers;  a  distributed  file  system  is  not  transparently  sealable  if  elient  applieations  must  be 
aware  of  eapaeity  exhaustion  of  a  single  server  or  different  semanties  depending  on  whieh  server(s)  hold 
aeeessed  files. 

No  transparent  scalability:  Many  distributed  file  systems,  ineluding  those  most  widely  deployed,  do 
not  seale  transparently.  NFS,  CIFS,  and  AFS  all  have  the  property  that  file  servers  ean  be  added  but  that 
eaeh  serves  independent  file  systems  (or  volumes,  in  the  ease  of  AFS).  A  elient  ean  mount  tile  systems  from 
multiple  file  servers,  but  must  eope  with  eaeh  server’s  limited  eapaeity  and  the  faet  that  eertain  operations 
(e.g.,  RENAME)  are  not  atomic  across  servers. 

Our  focus,  in  this  paper,  is  on  distributed  file  systems  that  provide  fairly  strong  consistency  semantics. 
But,  file  systems  that  provide  weaker  consistency,  such  as  eventual  consistency  with  posthoc  conflict  detec¬ 
tion  and  application/user-assisted  resolution  (e.g..  Bayou  [27],  Pangaea  [24],  and  Ivy  [20]),  could  also  go 
into  this  “no  transparent  scalability”  category. 

Transparent  data  scalability:  An  increasingly  popular  design  principle  is  to  separate  metadata  man¬ 
agement  (e.g.,  directories,  quotas,  data  locations)  from  data  storage  [3,  13,  17,  28].  The  latter  can  be  trans¬ 
parently  scaled  relatively  easily,  assuming  all  multi-object  operations  are  handled  by  the  metadata  server(s), 
since  each  data  access  is  independent  of  the  others.  Clients  interact  with  the  metadata  server(s)  for  metadata 
activity  and  to  discover  the  locations  of  data.  They  then  access  data  directly  at  the  appropriate  data  server(s). 
Metadata  semantics  and  policy  management  stay  with  the  metadata  server(s),  permitting  simple  centralized 
solutions.  The  metadata  server(s)  can  limit  throughput,  of  course,  but  offloading  data  accesses  pushes  the 
overall  system’s  limit  much  higher  [14].  Scaling  of  the  metadata  service  is  needed  to  go  beyond  this  limit, 
and  many  existing  systems  do  not  offer  a  transparent  metadata  scaling  solution  (e.g..  Lustre  [18]  and  most 
SAN  file  systems). 

Transparent  scalability:  A  few  distributed  file  systems  offer  full  transparent  scalability,  including 
Farsite  [2],  GPFS  [25],  and  Frangipani  [28].  Most  use  the  data  scaling  architecture  above,  separating  data 
storage  from  metadata  management.  Then,  they  add  protocols  for  handling  metadata  operations  that  span 
metadata  servers.  Section  2.3  discusses  these  further. 
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2.2  Multi-file  operations 

There  are  a  variety  of  operations  that  manipulate  multiple  files,  thus  ereating  a  eonsisteney  ehallenge  when 
the  files  are  nol  all  on  fhe  same  server.  Of  eourse,  every  file  ereafe  and  delefe  involves  fwo  files:  fhe  parenf 
direefory  and  fhe  file  being  ereafed  or  delefed.  Buf,  mosf  systems  assign  a  file  fo  fhe  server  fhaf  owns  ifs 
parenf  direefory.  Af  some  poinfs  in  fhe  namespaee,  of  eourse,  a  direefory  musf  be  assigned  somewhere  ofher 
fhan  fhe  home  of  ifs  parenf,  or  else  all  mefadafa  will  be  managed  by  a  single  mefadafa  server.  So,  fhe  ereafe 
and  delefe  of  fhaf  direefory  will  involve  more  fhan  one  server,  buf  none  of  fhe  ofher  operafions  on  if  will 
do  so.  This  seefion  deseribes  fhree  more  signifieanl  sourees  of  mulli-file  operations:  non-frivial  namespaee 
manipulafions,  fransaefions,  and  snapshofs. 

Namespace  manipulations:  The  mosf  eommonly  noted  multi-file  operafion  is  RENAME,  whieh  ehanges 
fhe  name  of  a  file.  The  new  name  ean  be  in  a  differenl  direefory,  whieh  would  make  fhe  RENAME  operafion 
involve  bofh  fhe  souree  and  desfinafion  parenf  direefories.  Also,  a  RENAME  operafion  ean  involve  addifional 
files  if  fhe  destination  exisfs  (and,  fhus,  should  be  delefed)  or  if  fhe  file  being  renamed  is  a  direefory  (in  whieh 
ease,  fhe  enfry  musf  be  modified).  Applieafion  programming  is  simplesf  when  fhe  RENAME  operafion 
is  afomie,  and  bofh  fhe  POSIX  and  fhe  NFSv3  speeifieafions  eall  for  afomieify.^  Many  applieafions  assume 
afomie  RENAME,  or  af  leasf  fhaf  fhe  desfinafion  will  be  ifs  before  or  afler  version,  as  a  building  bloek.  For 
example,  many  doeumenf  edifing  programs  implemenf  afomie  updafes  by  wrifing  fhe  new  doeumenf  version 
info  a  temporary  file  and  fhen  using  RENAME  fo  move  if  fo  fhe  user-assigned  name.  Wifhouf  afomieify, 
applieafions  and  users  ean  see  sfrange  intermediate  sfafes,  sueh  as  fwo  idenfieal  files  (one  wifh  eaeh  name) 
exisfing  or  one  file  wifh  bofh  names  as  hard  links. 

Creafion  and  delefion  of  hard  links  are  also  mulli-file  operations.  Links  ereafed  in  fhe  same  direefory 
rarely  resulf  in  eross-server  operafions.  Buf,  for  links  ereafed  in  ofher  direefories,  if  is  possible  for  fhe  fwo 
direefories  wifh  names  for  a  file  fo  be  on  differenl  servers.  Thus,  fhe  ereafe  and  subsequenl  delefe  of  fhaf 
file  would  involve  bofh  servers.  The  same  ean  happen  wifh  a  RENAME  fhaf  moves  a  file  info  a  direefory 
managed  by  a  differenl  server  fhan  fhe  direefory  in  whieh  if  was  originally  ereafed. 

Transactions:  Transaefions  are  a  very  useful  building  bloek.  Modern  file  syslems,  sueh  as  NTFS  [21] 
and  ReiserT  [23],  are  adding  supporf  for  mulli-requesl  fransaefions.  So,  for  example,  an  applieafion  eould 
update  a  sel  of  files  alomieally,  ralher  fhan  one  af  a  time,  and  Ihereby  preelude  olhers  seeing  infermediale 
forms  of  fhe  sel.  This  is  parlieularly  useful  for  program  inslallalion  and  upgrade. 

Snapshot:  Poinl-in-lime  snapshofs  [5,  15,  19,  22]  have  beeome  a  mandatory  feafure  of  mosf  storage 
systems,  as  a  fool  for  on-line  and  eonsislenl  off-line  baek-ups.  Snapshofs  also  offer  a  building  bloek  for 
on-line  inlegrily  eheeking  [19]  and  remote  mirroring  of  dafa  [22].  Snapshol  is  usually  supporfed  only  for 
enlire  file  syslem  volumes,  buf  some  systems  allow  snapshol  of  parlieular  sublrees  of  fhe  direefory  hierarehy. 
In  any  ease,  if  is  elearly  a  subslanlial  mulli-file  operation,  wifh  fhe  expeelalion  fhaf  fhe  snapshol  eaplures  all 
eovered  files  af  a  single  poinl  in  lime. 

2.3  Handling  cross-server  operations 

As  diseussed  in  Seefion  2.1,  few  dislribuled  file  systems  supporf  eross-server  operations  Iransparenlly.  Those 
fhaf  do  use  one  of  fwo  approaehes:  elienl-driven  proloeols  and  inler-server  profoeols. 

In  fhe  firsl  approaeh,  illuslraled  by  GPFS  [25]  and  Frangipani  [28],  fhe  elienl  requesting  fhe  eross- 
server  operafion  implemenls  if.  This  approaeh  is  generally  used  for  shared  disk  (or  logieal  disk)  syslems. 
The  elienl  aequires  loeks  on  fhe  relevanl  files  and  updates  Ihem  in  fhe  shared  disk  system,  using  wrife-ahead 
logging  info  shared  journals  fo  updale  Ihem  alomieally.  This  approaeh  offloads  work  from  servers,  whieh 

^Each  specification  indicates  one  or  more  corner  cases  where  atomicity  is  not  necessarily  required.  For  example,  POSIX  requires 
that,  if  the  destination  currently  exists,  the  destination  name  must  continue  to  exist  and  point  to  either  the  old  file  or  the  new  file 
throughout  the  operation. 
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is  generally  a  good  thing,  but  also  trusts  elients  to  partieipate  eorreetly  in  metadata  updates  and  introduees 
interesting  failure  seenarios  when  elients  fail. 

In  the  seeond  approaeh,  illustrated  by  Farsi te  [2],^  operations  that  span  servers  are  handled  by  an  inter¬ 
server  protoeol.  The  protoeol  usually  eonsists  of  some  form  of  two-phase  eommit,  with  a  lead  server  aeting 
as  the  initiator.  Beeause  this  seeond  approaeh  does  not  require  modifieation  of  the  elient-server  protoeol,  it 
is  the  one  most  often  identified  as  a  “planned  extension”  for  systems  that  do  not  yet  seale  to  more  than  one 
metadata  server  (e.g.,  [1,  18]).  Sueh  extensions  often  stay  “planned”  for  a  long  time,  however,  beeause  of 
eomplexity  and  performanee  overhead  eoneerns  (e.g.,  [7,  8]). 

Beeause  eross-server  operations  are  rare,  protoeols  implemented  just  to  support  them  usually  do  not 
reeeive  the  same  attention  as  the  more  common  cases.  They  tend  to  be  exercised  and  tested  less  thoroughly, 
becoming  a  breeding  ground  for  long-lived  bugs  that  will  arise  when  this  code  is  finally  relied  upon.  We 
promole  an  approach  based  on  eliminaling  cross-server  operalions,  via  heavy-weighl  redislribulion  lhal 
serves  mulliple  funclions,  ralher  lhan  supporling  Ihem  wilh  Iheir  own  prolocols.  Doing  so  can  simplify 
scalable  dislribufed  file  systems  wilh  minimal  impacl  on  performance,  so  long  as  cross-server  operalions  are 
Iruly  rare. 

3  Dynamic  redistribution 

The  abilily  lo  dynamically  redislribule  responsibility  for  sloring  and  managing  files  is  a  powerful  building 
block  of  scalable  file  syslems.  Dynamic  redislribulion  allows  a  syslem  lo  balance  load  when  hoi  spols  form 
or  one  server’s  capacity  is  exhausted.  Il  also  allows  Ihe  system  lo  reconfigure  responsibililies  when  servers 
are  added  or  removed,  so  as  lo  appropriately  ulilize  available  resources.  This  seclion  reviews  how  dynamic 
redislribulion  can  be  realized  and  describes  how  il  can  be  used  lo  eliminate  cross-server  operalions. 

3.1  Supporting  dynamic  redistribution 

Dynamic  redislribulion  has  been  a  pari  of  many  previous  system  designs.  For  example,  AFS  [16]  allows 
volumes  lo  be  migrated  freely  among  servers  in  an  AFS  cell.  xFS’s  design  [3],  which  splils  meladala  man- 
agemenl  from  dala  storage,  allows  for  migralion  of  each  (for  subsels  of  dala)  among  parlicipaling  systems. 

Allhough  nol  Irivial,  dynamic  redislribulion  is  nol  overly  complex.  Il  involves  four  primary  compo- 
nenls:  a  level  of  indirection  belween  IDs  and  locations,  a  mechanism  for  changing  lhal  mapping,  a  mecha¬ 
nism  for  moving  Ihe  conlenl  from  previous  “owner”  to  new,  and  a  mechanism  for  deciding  whal  should  be 
where. 

Mapping  objects  to  servers:  A  level  of  indirection  is  a  common  tool  for  gaining  flexibility  in  mapping 
virlual  identities  to  physical  resources.  Dynamic  redislribulion  exploils  such  a  level  of  indirection,  belween 
a  dala  or  meladala  ID  and  Ihe  server  lhal  manages  lhal  objecl,  which  can  be  provided  by  a  mapping  server 
(e.g.,  Ihe  volume  location  dalabase  in  AFS)  or  a  read-shared  fable  (e.g.,  Ihe  manager  map  in  xFS). 

Changing  the  mapping:  When  Ihe  mapping  is  changed,  relevanl  clienls  and  servers  musl  know  aboul 
il.  The  previous  and  currenl  servers  musl  know,  since  Ihe  former  should  stop  servicing  requesls  and  Ihe 
laller  should  slarl.  For  Ihe  servers  involved,  al  leasl,  Ihere  musl  be  agreemenl  on  who  “owns”  a  given  objecl 
al  every  poinl  in  time.  Also,  clienls  obviously  musl  be  able  to  learn  Ihe  lalesl  server  for  a  given  objecl  in 
order  to  be  able  to  access  il. 

A  change  to  Ihe  mapping  usually  includes  Iwo  parls:  updating  Ihe  mapping  and  rerouting  clienls  wilh 
oul-of-dale  mapping  information.  The  mapping  can  be  updated  locally,  in  Ihe  mapping  server  case,  or 

^Recall  that  “server”  in  our  discussion  could  be  replaced  with  “replicated  state  machine,”  with  no  other  changes.  Farsite  uses 
BFT-based  [4]  Byzantine  fault-tolerant  state  machine  replication  to  form  each  “directory  group”  that  manages  a  subset  of  the 
namespace.  For  Farsite,  “cross-server  operations”  means  “cross-directory  group  operations.” 
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updated  and  pushed  out  in  the  read-shared  table  ease.  Clients  ean  easily  end  up  going  to  a  previous  server 
with  a  request — for  example,  if  it  requested  the  mapping  just  before  the  ehange  and  sent  the  request  just 
after — so  servers  must  reroute  sueh  elient  requests  to  the  appropriate  server.  They  ean  do  so  by  informing 
the  elient  of  the  ehange  or  forwarding  the  request  direetly. 

Moving  the  content:  In  addition  to  shifting  responsibility,  the  aetual  eontent  being  served  needs  to  be 
put  under  the  eontrol  of  the  new  server.  There  are  two  eases  to  eonsider.  First,  when  the  objeet  metadata 
being  redistributed  is  stored  in  a  storage  infrastrueture  shared  by  the  servers  (e.g,  in  a  SAN  to  whieh  they 
are  all  attaehed),  it  is  suffieient  for  the  previous  server  to  eheekpoint  any  relevant  dirty  eaehe  state  and 
stop  aeeessing  it.  The  new  server  ean  then  take  over  responsibility  for  that  metadata  and  aeeess  it  direetly 
on  the  same  storage  deviee.  This  “shared  storage”  arehiteeture  is  inereasingly  eommon  in  sealable  file 
systems  [2,  3,  28].  Seeond,  when  objeets  are  stored  on  eaeh  server’s  loeally- attaehed  storage  deviees,  they 
must  aetually  be  eopied  from  one  server  to  another.  Clearly,  sueh  eopying  is  mueh  more  expensive  than  the 
first  ease,  and  thus  redistribution  in  sueh  an  environment  must  be  more  earefully  used. 

While  moving  the  eontents  from  one  server  to  another,  most  systems  will  bloek  write  requests  for  those 
eontents.  In  sueh  systems,  moving  the  eontents  usually  eannot  be  a  baekground  aetivity,  and  redistribution 
ean  ereate  a  notieeable  hieeup  in  performanee. 

Coordinating  changes:  A  system  that  supports  dynamie  redistribution  must  have  polieies  for  deeiding 
whieh  ehanges  to  make  and  when  to  make  them,  as  well  as  a  meehanism  for  invoking  and  eoordinating 
ehanges.  Redistribution  is  usually  used  to  improve  performanee  or  address  a  eapaeity  eonstraint.  At  the  same 
time,  it  eomes  with  overheads  (e.g.,  loeking  and  data  eopying).  Thus,  deeisions  to  redistribute  should  balanee 
these  issues.  In  many  systems,  redistribution  is  driven  by  explieit  administrator  tools,  and  approaehes  for 
automated  deeision-making  eontinue  to  be  refined. 

Changes  fhat  are  desired  musf  be  eoordinated  to  handle  possibly  eoneurrent  redistributions.  One  ap- 
proaeh  is  to  have  a  eentral  ‘Coordination  server”  that  deeides  what  should  be  where  and  enaets  one  ehange 
at  a  time  to  move  towards  that  state.  Alternately,  a  system  ean  use  agreement  protoeols  among  the  relevant 
servers  to  deeide  on  a  ehange  to  the  mapping  table  and  then  enaet  it.  Eaeh  ehange  involves  loeking  the 
relevant  objeets,  moving  them,  ehanging  the  mapping  information,  and  unloeking  the  objeets.  Changes  for 
independent  sets  of  objeets  ean  eertainly  proeeed  in  parallel,  though  the  rate  of  ehanges  should  not  be  high 
enough  to  make  sueh  eoneurreney  important. 

3.2  Eliminating  cross-server  operations 

In  addition  to  load  and  eapaeity  balaneing,  dynamie  redistribution  ean  also  be  used  to  eliminate  eross-server 
operations;  Figure  1  illustrates  our  method  for  aeeomplishing  this  goal.  The  eoneept  is  straightforward: 
when  a  multi-objeet  request  arrives  that  requires  aeeess  to  objeets  “owned”  by  distinet  servers,  the  system 
ean  redistribute  those  objeets  and  then  serviee  the  request  on  the  server  that  owns  them  all.  If  desired,  the 
objeets  ean  be  redistributed  to  their  pre-request  servers  immediately  after  the  request  is  eompleted,  or  they 
ean  remain  on  the  new  server  until  the  load-balaneing  poliey  dietates  that  they  be  moved.  Although  sueh 
dynamie  redistribution  may  be  signifieantly  less  effieient  than  an  explieit  eross-server  update  protoeol,  the 
impaet  on  overall  performanee  will  be  small  if  eross-server  operations  are  rare. 

Within  this  basie  eoneept,  there  are  many  options.  For  example,  the  most  effieient  approaeh  to  han¬ 
dling  a  single  multi-objeet  request  is  to  redistribute  only  the  objeets  involved,  partieularly  if  the  server  state 
is  not  kept  in  shared  storage.  But,  limitations  of  the  data  struetures  used  for  storing  metadata  and  map¬ 
ping  loeations  may  not  allow  redistribution  of  just  a  few  objeets;  they  may  dietate  that  large  eolleetions 
be  redistributed  together,  partieularly  if  they  were  designed  with  only  load  balaneing  in  mind.  Perhaps  the 
simplest  approaeh  is  to  eollapse  all  objeets  owned  by  two  servers  into  one,  splitting  them  again  afterwards  if 
imbalanee  eoneerns  dietate  doing  so.  If  the  need  for  redistribution  is  infrequent,  these  ehoiees  ean  be  made 
largely  based  on  implementation  simplieity  and  eonformanee  to  other  uses  of  the  dynamie  redistribution 
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Figure  1 :  Design  for  eliminating  cross-server  operations.  The  sequence  of  operations  required  to  handle  rename  a  to  b  is 
shown.  Returning  to  the  original  state  is  similar. 


mechanism. 

4  Implementation 

We  implemented  dynamic  redistribution  and  cross-server  RENAME  on  our  prototype  distributed  storage 
system,  Ursa  Minor.  A  high-level  overview  of  our  implementation  is  provided  in  Section  4. 1  and  the  details 
are  described  in  Section  4.2.  Further  information  about  Ursa  Minor  can  be  found  in  Abd-el-Malek  et  al  [1]. 

4.1  Overview 

Ursa  Minor’s  architecture  is  based  on  the  NASD/OSD  storage  model  [13,  29]  (see  Figure  2).  In  this  model, 
clients  query  a  metadata  server  to  map  object  IDs  to  the  storage  nodes  in  which  that  object’s  data  is  stored. 
The  metadata  server  stores  its  metadata  in  tables,  which  are  stored  as  objects  in  the  storage  node  cluster. 
It  accesses  these  objects  as  any  other  client  would.  Under  high  load,  the  centralized  metadata  server  can 
become  a  bottleneck.  Though  it  could  be  replicated  for  fault  tolerance,  our  metadata  server  is  not;  however, 
it  does  provides  a  level  of  failure  recovery  by  synchronously  writing  metadata  operations  through  to  the 
underlying  storage  nodes.  A  crashed  metadata  server  can  be  recovered  by  pointing  a  new  metadata  server  to 
the  crashed  server’s  metadata  table.  This  server  is  implemented  in  about  20,000  lines  of  code  (see  Table  1). 

Implementation  of  dynamic  redistribution  involved  modifying  a  centralized  metadata  server  capable  of 
crash  recovery.  Additionally,  a  special  root  metadata  server  was  created;  in  our  implementation,  this  server 
is  responsible  for  coordinating  redistribution.  Since  the  granularity  of  redistribution  is  a  single  table  in  our 
implementation,  the  metadata  servers  distribute  object  metadata  over  multiple  tables  in  order  to  minimize 
the  number  of  objects  that  must  be  moved  for  a  single  a  cross-server  operation.  To  dynamically  redistribute 
a  portion  of  a  metadata  server’s  load,  the  associated  table  is  “crashed”  at  the  source  and  recovered  at  the 
destination.  The  root  metadata  server  coordinates  the  distribution  process.  Approximately  3000  lines  of 
code  were  modified  to  support  multiple  tables  and  the  special  features  of  the  root  metadata  server. 

Once  dynamic  load  redistribution  was  in  place,  adding  cross-server  operations  required  only  300  lines 
of  code.  Cross-server  operations  are  built  entirely  on  top  of  the  dynamic  redistribution  mechanism  by  redis¬ 
tributing  management  of  all  involved  objects  to  a  single  server  and  then  executing  the  standard  command 
code  path.  This  is  described  in  more  detail  in  Section  4.2.2. 

Our  implementation  is  detailed  below  to  show  how  little  complexity  is  involved  in  our  solution  and  to 
demonstrate  how  a  legacy  system  can  be  retrofitted  for  dynamic  redistribution  and  cross-server  operations. 
The  primary  design  dictum  was  to  do  whatever  is  simplest  with  little  consideration  for  performance.  Our 
experimental  results  show  that  an  implementation  need  not  be  elegant  or  efficient  to  scale  well. 
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Figure  2:  Direct  client  access.  First,  a  client  interacts  with  the  metadata  server  to  obtain  mapping  information  (e.g.,  which 
object  IDs  to  access  and  on  which  storage  devices)  and  capabilities  (i.e.,  evidence  of  access  rights).  Second,  the  client  interacts 
with  the  appropriate  storage  device(s)  to  read  and  write  data,  providing  a  capability  with  each  request. 


Component 

Lines  of  Code 

Metadata  Server 

19694 

Dynamic  redistribution 

2959 

Cross-Server  Rename 

300 

Table  1 :  Complexity  in  terms  of  Lines  of  Code.  If  facilities  for  dynamic  redistribution  are  in  place,  adding  cross-server  rename 
is  relatively  simple. 


4.2  Implementation  in  Ursa  Minor 

In  Ursa  Minor,  clients  access  data  on  storage  nodes  directly  once  they  have  retrieved  the  data  location,  en¬ 
coding,  and  permissions  from  the  metadata  server.  Clients  access  storage  nodes  using  the  PASIS  read/write 
protocols  [30]  which  provide  fault  tolerance  and  consistency  for  block  reads  and  writes.  Storage  nodes  use 
non-volatile  RAM  to  allow  writes  to  return  immediately  as  long  as  there  is  room  in  the  write-back  buffer 
cache. 

The  metadata  server  is  responsible  for  granting  permissions  and  storing  metadata  safely.  Because  meta¬ 
data  is  small  and  RAM  is  plentiful,  most  read-only  operations,  such  as  lookup,  are  expected  to  hit  in  cache 
and  return  almost  immediately.  Operations  that  require  modifications,  however,  are  far  more  expensive.  To 
durably  store  an  operation  that  creates  metadata  (e.g:  CREATE)  or  an  operation  that  changes  metadata  (e.g: 
RENAME  or  a  change  of  object  size)  our  metadata  server  writes  the  modified  metadata  to  a  storage  node 
synchronously.  This  limitation  is  aggravated  because  our  naive  implementation  limits  update  concurrency. 

Metadata  servers  store  metadata  records  in  a  B-tree  table.  The  B-tree  is  updated  atomically  using 
the  following  simple  shadow  paging  scheme.  Each  page  contains  a  list  of  all  other  pages  updated  by  that 
transaction,  the  transaction  ID,  and  the  new  and  previous  versions  of  that  page’s  data.  The  PASIS  RAV 
protocols  guarantee  that  page  writes  are  atomic.  Only  one  transaction  is  allowed  to  commit  at  a  time  in  the 
current  implementation. 

This  policy  guarantees  that  if  transaction  N  completed,  any  transaction  numbered  lower  than  N  has 
also  completed.  Otherwise,  a  transaction  has  completed  if  all  associated  pages  have  been  written.  Whether 
a  page  is  part  of  a  partially  completed  transaction  is  determined  by  verifying  that  all  pages  involved  in  that 
transaction  were  written  by  that  transaction  or  a  subsequent  one.  Given  that  a  transaction  N  has  completed, 
only  pages  marked  with  a  higher  transaction  ID  need  be  examined.  If  a  transaction  is  incomplete,  it  can 
be  rolled  back  using  the  previous  data  included  in  the  page.  The  detection  and  rollback  process  can  be 
performed  by  a  f  sck-like  tool  or  online. 
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The  primary  benefit  of  this  meehanism  is  that  it  is  simple.  A  failed  metadata  server  ean  easily  be 
restarted  on  another  node.  Similarly,  an  unresponsive  metadata  server  ean  be  foreed  down  and  restarted  on 
another  node.  It  also  works  when  using  multiple  tables,  though  the  reeovering  server  must  have  aeeess  to  all 
tables  involved  in  the  last  transaetion.  This  ean  be  avoided  during  a  elean  shutdown  if  the  server  performs  a 
final  transaetion  on  eaeh  table  sueh  that  its  last  transaetion  involves  no  other  table. 

4.2.1  Dynamic  redistribution 

Dynamie  redistribution  was  added  in  two  steps.  First,  metadata  was  partitioned  into  multiple  B-tree  tables, 
eaeh  holding  metadata  for  a  set  of  objeet  IDs.  All  distribution  is  done  at  the  granularity  of  these  tables  and 
eaeh  table  is  managed  by  a  single  metadata  server  at  any  time.  Though  tables  eover  a  fixed  range  of  objeet 
IDs  in  our  prototype,  we  are  eurrently  implementing  support  for  dynamie  resizing  of  tables. 

The  seeond  step  was  to  implement  the  aetual  redistribution  routines.  The  root  metadata  server  ean 
request  that  a  metadata  server  assume  or  relinquish  responsibility  for  a  metadata  table.  A  metadata  server 
that  does  not  respond  in  a  timely  manner  is  assumed  faulty  and  is  reeovered  as  deseribed  below.  Henee, 
the  root  metadata  server  has  the  following  three  responsibilities.  First,  it  keeps  a  delegation  list  deseribing 
whieh  metadata  server  manages  eaeh  objeet.  Seeond,  it  eoordinates  redistribution.  Third,  it  durably  stores 
the  metadata  for  all  metadata  tables  in  a  meta-metadata  table  that  it  manages.  This  latter  table  is  required 
to  bootstrap  the  root  and  other  metadata  servers  as  well  as  to  reeover  from  er ashes.  Eaeh  metadata  server 
queries  the  root,  on  startup,  for  the  delegation  list  and  metadata  to  aeeess  the  tables  it  is  assigned  to  manage. 

A  metadata  server  must  exeeute  eommands  issued  by  the  root.  Upon  assuming  management  of  a 
table,  a  metadata  server  verifies  its  eonsisteney  and  begins  serving  queries  for  that  table.  To  relinquish 
responsibility  for  a  table,  a  metadata  server  eompletes  exeeuting  queries  and  queues  further  queries.  It  then 
flushes  its  eaehe  of  the  table  to  the  storage  nodes  and  eommits  a  null  transaetion  to  eleanly  elose  the  table. 
While  management  responsibilities  are  being  transferred,  there  is  a  short  period  when  no  metadata  server  is 
managing  the  table.  If  the  elient  tries  to  issue  a  request,  it  will  diseover  this  and  will  re-feteh  the  delegation 
list.  Henee,  queries  will  queue  up  at  the  elient  during  this  time.  A  metadata  server  ean  request  more  or  fewer 
tables  and  it  may  request  adding  or  dropping  management  responsibilities  for  a  partieular  objeet.  If  the  root 
agrees  to  sueh  requests,  it  will  eoordinate  the  sequenee  ehange  in  management  responsibilities. 

Clients  are  unehanged  exeept  that  they  must  request  an  updated  delegation  list  from  the  root  if  the 
metadata  for  an  objeet  is  not  where  expeeted.^  This  ean  happen  at  startup  or  after  redistribution.  Of  eourse, 
a  elient  must  issue  requests  to  the  metadata  server  speeified  in  the  latest  delegation  list  that  it  has  eaehed. 

Reeovery  is  a  more  general  variant  of  the  eentralized  ease.  Instead  of  restarting  a  failed  server  on 
a  new  host,  the  metadata  tables  from  a  failed  metadata  server  are  redistributed  to  the  remaining  metadata 
servers.  If  a  new  host  is  booted  for  the  purpose  of  reeovery,  this  server  is  a  natural  eandidate  for  managing 
sueh  orphaned  tables.  Our  eurrent  implementation  suffers  from  the  simple  failure  reeovery  meehanism 
outlined  above.  Speeitieally,  all  tables  involved  in  a  transaetion  are  required  to  reeover  a  table  of  questionable 
eonsisteney.  Therefore,  the  eonsisteney  of  tables  with  overlapping  operations  must  be  eheeked  from  a 
single  server.  Fortunately,  few  operations  require  more  than  one  table.  A  fsck-like  tool  ean  be  used  in 
the  baekground  to  eertify  a  table  as  eonsistent.  To  mitigate  this  expense,  the  reeovering  metadata  server  ean 
aeeept  all  of  a  failed  server’s  tables  and  then  redistribute  its  original  tables  to  shed  load. 

4.2.2  Cross-server  operations 

We  implemented  two  atomie  eross-server  operations,  RENAME  and  ENUMERATE.  Enumerate  provides 
an  atomie  view  of  the  metadata  for  a  subset  of  objeets  in  the  system.  We  need  ENUMERATE  only  for 
bootstrapping,  so  it  need  not  be  fast. 

Delegation  list  queries  can  be  served  by  other  metadata  servers,  as  well. 
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Figure  3:  Our  implementation  of  cross-server  operations.  The  sequence  of  operations  required  to  handle  rename  a  to  b  is 
shown.  Returning  to  the  original  state  is  similar. 


Implementing  cross-server  operations  on  top  of  dynamic  redistribution  was  simple.  Here,  we  outline 
RENAME;  ENUMERATE  is  similar.  To  move  an  object  from  a  directory  managed  by  one  metadata  server  to 
one  managed  by  another,  the  metadata  server  with  the  source  directory  borrows  management  responsibility 
for  the  destination  directory.  The  root  metadata  server  coordinates  this  redistribution,  and  the  metadata 
server  executes  a  same-server  RENAME.  When  the  operation  completes,  the  metadata  server  requests  that 
the  root  redistribute  the  destination  directory  back  to  its  original  server.  Figure  3  illustrates  the  first  half  of 
this  sequence. 

For  simplicity,  in  our  implementation,  the  server  performing  a  cross-server  RENAME  handles  no  other 
queries  during  the  RENAME  operation.  As  a  result,  if  each  metadata  server  manages  N  tables,  A-|- 1  tables  are 
essentially  locked  for  the  duration  of  the  RENAME.  Additionally,  the  root  metadata  server  avoids  deadlock 
problems  (e.g.,  a  case  where  one  server  holds  a  lock  for  table  A  and  requests  another  for  table  B  while 
another  holds  the  lock  for  table  B  and  wants  the  lock  for  table  A)  by  allowing  only  one  redistribution  in  the 
system  at  a  time.  Though  there  are  higher  performance  solutions  to  this  potential  deadlock  problem,  our 
simple  solution  is  sufficient;  it  does  not  degrade  performance  in  practice  due  to  the  rareness  of  cross-server 
operations. 

5  Evaluation 

We  evaluated  the  scalability  of  our  prototype  on  a  cluster  of  twenty  well-connected  server  nodes  (the  hard¬ 
ware  setup  is  in  Table  2).  Of  these  twenty  machines,  there  were  six  each  of  metadata  servers,  storage  nodes, 
and  load  generators.  The  other  two  were  used  for  the  root  metadata  server  and  its  storage  node. 

Metadata  tables  were  distributed  equally  across  the  storage  nodes,  and  each  metadata  server  was  ini¬ 
tially  given  management  responsibility  for  all  tables  at  one  storage  node.  Each  storage  node  used  a  20  GB 
partition  on  its  disk  for  the  metadata  tables  assigned  to  it. 

Metadata  servers  used  a  640  MB  write-through  cache,  and  storage  nodes  used  a  640  MB  write-back 
cache  stored  in  non-volatile  RAM. 

Though  storage  clusters  with  hundreds  or  even  thousands  of  disks  are  becoming  more  common  [6,  9], 
our  resources  are  more  limited.  To  fully  load  our  metadata  servers  with  a  standard  filesystem  benchmark 
would  require  more  storage  nodes  than  we  have  available.  Furthermore,  it  would  not  allow  us  as  much 
control  over  our  experiments.  We  implemented  a  distributed  workload  generator  that  replays  operations 
from  the  traces  described  in  Section  5.2.  The  workload  generator  bypasses  our  prototype’s  filesystem  and 
instead  queries  the  metadata  servers  directly.  To  ensure  that  all  available  metadata  server  throughput  was 
utilized,  we  used  as  many  machines  for  the  load  generator  as  for  metadata  servers.  Operations  are  played  in 
parallel,  as  fast  as  possible,  so  long  as  the  operation  involves  only  objects  that  have  already  been  successfully 
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Servers 

20  X  Dell  PE650 

Memory 

Processor 

Disk 

NIC 

Switch 

1  GB  RAM 
2.66  Ghz  Pentium  4 
Seagate  ST336607LW  36  GB 
Intel  el 000 
Dell  PowerConnect  5224 

Table  2:  Hardware  setup.  Metadata  Server,  Storage  Node,  and  Load  Generator  configuration. 


Figure  4:  Metadata  Servers  vs  Throughput.  Throughput  using  the  Harvard  EECS  workload  scales  linearly  with  respect  to  the 
number  of  Metadata  Servers. 

created. 

Objects  seen  in  the  trace  are  randomly  distributed  among  the  available  metadata  tables.  Operations 
within  a  directory  are  always  executed  on  a  single  server.  Directory  locality  is  lost  in  the  trace  replay,  so 
directories  are  similarly  distributed  randomly  amongst  metadata  tables,  creating  more  cross-server  opera¬ 
tions  than  would  truly  be  present.  Operations  that  span  directories  (cross-directory  renames)  require  both 
directories;  because  directories  are  randomly  distributed,  such  operations  will  be  executed  on  two  random 
servers.  So,  for  any  N  metadata  servers,  there  is  a  1/A^  chance  that  such  operations  will  operate  on  a  single 
server. 

5.1  Performance 

Figure  4  shows  that  our  prototype  scales  linearly  for  an  average  day  of  the  Harvard  EECS  workload  de¬ 
scribed  below.  (Operation  counts  for  the  three  workloads  described  below  are  given  in  Table  4.)  In  this 
sense,  our  simplistic  implementation  exhibits  optimal  performance.  As  expected,  when  cross-server  opera¬ 
tions  are  rare,  cross-server  operation  performance  is  largely  unimportant. 

Table  3  shows  average  latency  for  a  number  of  individual  operations.  Clients  get  a  response  to  metadata 
update  operations  in  1000  qs  because  the  metadata  server  must  durably  write  the  operation  to  a  storage  node. 
The  response  time  goes  down  to  260  qs — almost  the  network  roundtrip  time  of  230  qs — if  the  operation  can 
be  serviced  from  the  Metadata  Server  cache.  Atomic  cross-server  renames  exhibit  a  latency  of  about  7000  qs, 
which  is  tolerable  for  interactive  operations  and,  because  cross-server  renames  are  so  rare,  does  not  affect 
overall  performance. 
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Operation 

Latency  at  Client 

CREATE 

1 .20ms 

LOOKUP 

0.26ms 

UPDATE 

1.00ms 

Same-server  RENAME 

1.00ms 

Cross-server  RENAME 

7.10ms 

Ping 

0.23ms 

Table  3l  Microbenchmark.  Lookup  is  fast.  Modifying  operations  are  slower,  and  cross-server  rename  is  slowest. 


The  eost  of  eross-server  rename  is  not  entirely  refleeted  in  the  individual  operation  lateneies  beeause 
moving  a  table  for  a  rename  affeets  more  than  just  the  rename  operation.  The  most  expensive  eost  of  eross- 
server  operations  in  our  system  is  that  the  issuing  metadata  server  must  loek  its  tables  while  it  borrows 
and  returns  the  destination  table.  This  eost  is  entirely  an  artifaet  of  our  implementation  running  on  top  of 
a  naive  erash-reeovery  meehanism  and  ean  be  avoided  in  systems  where  it  affeets  overall  performanee.  A 
seeond  penalty  is  that  a  RENAME  flushes  the  destination  metadata  server’s  eaehed  eopy  of  a  table.  This 
eould  be  partially  resolved  by  subdividing  eaeh  table  further  and  further  resolved  by  reloading  previously 
eaehed  entries  when  a  table  is  returned.  Additonally,  if  the  borrower  promises  to  modify  only  one  objeet, 
the  lender  only  needs  to  flush  that  one.  If  the  issuing  metadata  server  did  not  stall,  a  third  eost  would  be 
that  this  server  would  now  need  to  serviee  queries  to  its  alloeated  tables  plus  any  queries  to  any  tables  that  it 
is  eurrently  borrowing.  This  ean  be  limited  by  using  more  fine-grained  redistribution,  either  by  using  more 
tables  or  by  dynamieally  ereating  tables.  We  plan  to  support  more  fine-grained  redistribution  by  dynamieally 
splitting  and  merging  tables;  eross-server  operation  performanee  will  improve  as  a  result  for.  For  overall 
performanee,  however,  none  of  these  eosts  are  signifieant  given  the  rarity  of  eross-server  operations. 

Figure  5  eonsiders  the  impaet  of  the  eross-server  operation  frequeney  on  throughput.  The  proportion 
of  eross-direetory  RENAMES  is  gradually  inereased  in  a  workload  otherwise  identieal  to  the  Harvard  EECS 
distribution  aeeessing  six  metadata  servers.  Even  when  eross-direetory  renames  are  0.1%  of  the  workload, 
whieh  is  orders  of  magnitude  larger  than  in  the  traees,  our  unoptimized  implementation  aehieves  80%  of  its 
maximal  throughput. 

To  understand  this  data,  eonsider  the  state  of  eaeh  metadata  server.  Reeall  that  our  prototype  allows 
only  one  eross-server  rename  to  proeeed  at  a  time,  and  the  metadata  server  handling  that  rename  handles  no 
other  requests.  Henee,  for  six  metadata  servers,  when  there  is  always  an  outstanding  eross-server  RENAME, 
the  expeeted  throughput  is  just  under  |,  or  about  80%.  The  system  does  not  seale  past  this  boundary;  when 
more  than  one  RENAME  is  outstanding,  that  RENAME  must  wait  until  the  first  eompletes.  So,  we  expeet 
our  prototype  to  seale  well  until  eross-server  renames  eonfliet.  Indeed,  if  5  metadata  servers  respond 
to  45,000  operations  per  seeond  (from  Eigure  4)  and  a  eross-server  RENAME  takes  0.007  seeonds,  then 
all  six  metadata  servers  ean  respond  to  one  eross-server  RENAME  and  315  operations  in  0.007  seeonds. 
Henee,  our  prototype  will  not  seale  when  eross-server  renames  are  ^  =  0.3%  of  operations  (as  shown  in 
Eigure  5).  If  our  implementation  allowed  eoneurrent  redistribution  and  eoneurrent  queries  during  RENAME, 
performanee  would  be  bound  only  by  the  overhead  of  dynamie  redistribution.  We  estimate  that  sueh  a 
system  eould  aehieve  over  90%  of  its  maximum  throughput  even  if  eross-server  operations  represented  2% 
of  the  workload,  and  over  50%  if  eross-server  operations  were  10%.  Dynamie  resizing  of  tables  would 
provide  further  benefit. 

The  experimental  results  from  our  prototype  implementation  demonstrate  that  sealability  is  linear  if 
eross-server  operations  are  rare.  Eortunately,  eross-server  operations  are  extremely  rare  in  praetiee.  The 
following  seetion  quantifies  just  how  rare  these  operations  are. 
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Percent  of  cross-server  renames  in  workload 

Figure  5 :  Throughput  vs  Percent  of  cross-server  renames.  This  graph  shows  how  throughput  varies  as  cross-directory  renames 
are  added  to  a  workload. 


5.2  Analysis  of  renames  in  real-world  workloads 

We  analyzed  three  NFS  traces  in  order  to  determine  the  frequency  and  distribution  of  RENAME  operations  in 
real  workloads.  Our  results  show  that  while  renames  are  rare,  cross-directory  renames  are  even  rarer — 
they  comprise  a  maximum  of  only  0.006%  of  all  operations  in  the  traces.  We  also  find  that  cross-directory 
renames  arrive  in  quick  bursts,  indicating  that  it  might  be  possible  to  amortize  the  overhead  of  initiating  a 
cross-server  RENAME  over  multiple  RENAME  operations. 

5.2.1  Traces  used 

We  used  three  NFS  traces  [10]  from  Harvard  University  for  our  trace-based  evaluation.  We  describe  each 
trace  and  its  constituent  workload  below. 

EECS03:  The  EECS03  trace  captures  NES  traffic  observed  at  a  Network  Appliance  filer  between  Eebruary 
^nd_gth^  2003.  This  filer  serves  home  directories  for  the  Electrical  Engineering  and  Computer  Science 
Department.  It  sees  an  engineering  workload  of  research,  software  development,  course  work,  and 
WWW  traffic.  Detailed  characterization  of  this  environment  can  be  found  in  [12]. 

DEAS03:  The  DEAS03  trace  captures  NES  traffic  observed  at  another  Network  Appliance  filer  between 
Eebruary  2003.  This  filer  serves  the  home  directories  of  the  Department  of  Engineering  and 

Applied  Sciences.  It  sees  a  heterogenous  workload  of  research  and  development  combined  with  e- 
mail  and  a  small  amount  of  WWW  traffic.  The  workload  seen  in  the  DEAS03  environment  can  be 
best  described  as  a  combination  of  that  seen  in  the  EECS03  environment  and  e-mail  traffic.  Detailed 
characterization  of  this  environment  can  be  found  in  [1 1]  and  [12]. 

CAMPUS:  The  CAMPUS  trace  captures  a  subset  of  the  NES  traffic  observed  by  the  CAMPUS  storage 
system  between  October  15'^’-28^^\  2001.  The  CAMPUS  storage  system  provides  storage  for  the  e- 
mail,  web,  and  computing  activities  of  10,000  students,  staff,  and  faculty  and  is  comprised  of  fourteen 
53  GB  storage  disk  arrays.  The  subset  of  activity  captured  in  the  CAMPUS  trace  includes  only  the 
traffic  between  one  of  the  disk  arrays  (home02)  and  the  general  e-mail  and  login  servers.  NES  traffic 
generated  by  serving  web  pages  and  by  students  working  on  CS  assignments  is  not  included.  However, 
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EECS 

DEAS 

CAMPUS 

Total  operations 

176,933,945 

577,751,752 

655,637,109 

LOOKUP 

165,511,605 

92.979% 

567,130,767 

98.162% 

638,424,369 

97.375% 

CREATE 

1,241,034 

0.758% 

1,856,980 

0.321% 

1,917,930 

0.293% 

UPDATE 

9,469,619 

6.207% 

8,593,249 

1.487% 

152,288,896 

2.332% 

Same-directory  RENAME 

96,086 

0.0543% 

134,680 

0.023% 

5,666 

e 

Cross-directory  RENAME 

2,186 

0.0012% 

36,076 

0.006% 

248 

e 

Table  4:  Metadata  operation  breakdowns  for  the  EECS,  DEAS,  and  CAMPUS  month-long  traces.  The  equivalent  metadata 
operations  necessary  to  perform  each  operation  in  the  traces  are  counted  and  shown  in  this  table.  Note  that  Renames  are  very 
rare.  NFS  GETATTR  operations  are  not  counted  in  this  table  as  they  are  an  artifact  of  the  NFS  caching  model  and,  as  such,  clients 
of  an  object-based  storage  system  would  not  issue  these  requests  as  the  storage  system  would  likely  support  more  a  more  advanced 
caching  model  (e.g.,  leases).  Note  that  if  GETATTRs  were  counted,  the  contribution  to  total  operations  by  RENAMES  would  decrease 
even  further. 


despite  the  these  exclusions,  the  CAMPUS  trace  contains  more  operations  per  day  (on  average)  than 
either  the  EECS03  or  DEAS03  trace.  Detailed  characterization  of  this  environment  can  be  found 
in  [10]  and  [11], 

5.2.2  Trace  analysis 

Real-world  workloads  contain  orders  of  magnitude  fewer  cross-server  operations  than  can  be  supported 
by  our  rename  distribution  method  without  loss  of  performance.  Table  4  shows  the  breakdown  of  operation 
counts  for  both  traces,  renames  account  for  only  0.054%  of  all  operations  (96,086  operations)  in  the  EECS 
trace,  0.023%  of  all  operations  (134,680  operations)  in  the  DEAS  trace,  and  are  almost  non-existant  in  the 
CAMPUS  trace.  Additionally,  most  of  these  renames  will  not  require  a  multiple  server  interaction,  as 
they  move  files  within  the  same  directory  (i.e.,  the  source  directory  and  destination  directory  are  the  same). 
Cross-directory  renames,  which  may  require  a  multiple  server  interaction,  account  for  only  0.001%  of  all 
operations  (2,186  operations)  in  EECS  and  0.006%  of  all  operations  (34,076  operations)  in  DEAS.  Many  of 
these  would  not  be  cross-server,  because  they  are  among  directories  nearby  in  the  heirarchy  and  likely  to  be 
managed  by  the  same  server. 

Eigure  6  shows  that  cross-directory  renames  in  real-world  workloads  are  bursty.  Specifically,  in  the 
EECS  trace,  there  is  a  70%  chance  that  a  second  RENAME  will  be  seen  within  one  second  of  the  first;  this 
probability  is  93%  in  the  DEAS  trace.  Such  burstiness  suggests  that  the  overhead  of  dynamic  distribution  can 
be  amortized  across  multiple  cross-server  renames  as  long  as  all  of  them  involve  the  same  two  metadata 
servers.  Cross-directory  renames  are  not  as  bursty  in  the  CAMPUS  trace  as  in  the  EECS  and  DEAS  traces, 
but  the  overhead  of  dynamic  distribution  in  a  workload  resembling  CAMPUS  is  not  likely  to  be  significant 
as  renames  are  extremely  rare  in  this  trace. 

6  Future  work 

We  plan  to  improve  this  work  in  several  ways.  We  plan  to  evaluate  the  benefits  of  multiple  concurrent 
redistributions,  dynamic  resizing  of  tables,  and  concurrent  queries  during  cross-server  operations.  Because 
cross-server  operations  are  rare,  these  features  are  not  needed  for  average-case  performance  but  rather  for 
to  broaden  the  scope  of  applicability — we  would  like  to  demonstrate  in  a  graph  similar  to  Eigure  5  that  this 
approach  works  well  when  cross-server  operations  are  over  1  %  or  more  of  load.  We  plan  to  implement 
and  measure  other  cross-server  operations,  such  as  hourly  snapshot,  and  we  would  like  to  compare  directly 
the  performance  and  complexity  of  two-phase  commit  and  other  mechanisms.  We  plan  to  demonstrate 
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Figure  6:  These  graphs  show  the  percentage  of  cross-directory  rename  operations  vs.  maximum  inter-arrival  time.  Most 
cross-directory  RENAMES  occur  in  bursts  -  in  the  EECS  and  DEAS  trace,  70%  and  93%  of  all  cross-directory  RENAMES  occur 
within  one  second  of  each  other.  The  inter-arrival  times  of  cross-directory  renames  is  larger  in  the  CAMPUS  trace,  but  renames  are 
also  much  less  frequent  in  this  trace  than  the  EECS  and  DEAS  trace. 


how  to  efficiently  and  expeditiously  revoke  the  authority  of  an  errant  metadata  server.  Though  this  work 
is  orthogonal  to  distributed  workload  redistribution,  we  would  like  to  distribute  the  root  metadata  server’s 
functions.  Ideally,  we  would  like  to  demonstrate  the  simplicity  of  our  implementation  on  a  third-party 
object-based  storage  system  or  similar  distributed  system.  We  would  also  like  to  evaluate  trace  data  from  an 
environment  where  many  clients  access  many  objects  in  parallel. 

7  Summary 

Cross-server  operations  are  a  long-standing  source  of  headaches  for  implementers  of  scalable  distributed 
file  systems.  Dynamic  redistribution  can  be  used  to  eliminate  cross-server  operations,  eliminating  the  need 
for  protocols  dedicated  to  handling  them.  Although  it  is  much  more  heavy-weight  than  a  dedicated  cross¬ 
server  update  protocol,  the  use  of  dynamic  redistribution  can  simplify  the  system  with  minimal  impact  of 
performance.  Experiments  with  our  prototype  implementation  show  that  overall  performance  is  not  affected 
when  fewer  than  0.01%  of  operations  are  (initially)  cross-server.  Analysis  of  NFS  traces  shows  that  cross¬ 
server  operations  are  much  rarer  than  that — fewer  than  0.006%  of  client  requests  could  possibly  be  cross¬ 
server  in  the  traces  analyzed.  We  believe  that  this  approach  to  handling  infrequent  cross-server  operations 
is  very  promising  for  distributed  file  sysfems  and,  perhaps,  for  other  scalable  distributed  systems  as  well. 


Acknowledgements 

We  would  like  to  thank  Craig  Soules  for  his  comments  and  feedback.  We  also  thank  Dan  Ellard  and  Margo 
Seltzer  for  sharing  the  NFS  traces. 


14 


References 


[1]  Michael  Abd-El-Malek,  William  V.  Courtright  II,  Chuck  Cranor,  Gregory  R.  Ganger,  James  Hen¬ 
dricks,  Andrew  J.  Klosterman,  Michael  Mesnier,  Manish  Prasad,  Brandon  Salmon,  Raja  R.  Sambasi- 
van,  Shafeeq  Sinnamohideen,  John  D.  Strunk,  Eno  Thereska,  Matthew  Wachs,  and  Jay  J.  Wylie.  Ursa 
Minor:  versatile  cluster-based  storage.  Conference  on  File  and  Storage  Technologies  (San  Erancisco, 
CA,  13-16  December  2005),  pages  59-72.  USENIX  Association,  2005. 

[2]  Atul  Adya,  William  J.  Bolosky,  Miguel  Castro,  Gerald  Cermak,  Ronnie  Chaiken,  John  R.  Douceur,  Jon 
Howell,  Jacob  R.  Eorch,  Marvin  Theimer,  and  Roger  P.  Wattenhofer.  EARSITE:  federated,  available, 
and  reliable  storage  for  an  incompletely  busted  environment.  Symposium  on  Operating  Systems  Design 
and  Implementation  (Boston,  MA,  09-11  December  2002),  pages  1-15.  USENIX  Association,  2002. 

[3]  Thomas  E.  Anderson,  Michael  D.  Dahlin,  Jeanna  M.  Neefe,  David  A.  Patterson,  Drew  S.  Roselli, 
and  Randolph  Y.  Wang.  Serverless  network  file  systems.  ACM  Symposium  on  Operating  System 
Principles  (Copper  Mountain  Resort,  CO,  3-6  December  1995).  Published  as  Operating  Systems 
Review,  29(5):  109-126,  1995. 

[4]  Miguel  Casbo  and  Barbara  Eiskov.  Practical  Byzantine  fault  tolerance.  Symposium  on  Operating 
Systems  Design  and  Implementation  (New  Orleans,  EA,  22-25  Eebruary  1999),  pages  173-186.  ACM, 
1998. 

[5]  Ann  E.  Chervenak,  Vivekanand  Vellanki,  and  Zachary  Kurmas.  Protecting  file  systems:  a  survey  of 
backup  techniques.  Joint  NASA  and  IEEE  Mass  Storage  Conference  (March  1998),  1998. 

[6]  Don  Clark.  Eos  Alamos  Eab  Picks  Panasas  for  Data  Storage.  Wall  Street  Journal  Online,  20  Oct  2003. 

[7]  Clustered  MDS,  Apr  2006.  https://mail.clusterfs.com/wikis/lustre/DesignChanges. 

[8]  The  Eustre  file  system  roadmap.  May  2006.  http://www.clusterfs.com/roadmap.html. 

[9]  Selecting  a  Scalable  Cluster  Eile  System,  Nov  2005.  Cluster  Eile  Systems,  Inc. 

[10]  Daniel  Ellard,  Jonathan  Eedlie,  Pia  Malkani,  and  Margo  Seltzer.  Passive  NES  bacing  of  email  and 
research  workloads.  Conference  on  File  and  Storage  Technologies  (San  Erancisco,  CA,  31  March-2 
April  2003),  pages  203-216.  USENIX  Association,  2003. 

[11]  Daniel  Ellard,  Jonathan  Eedlie,  and  Margo  Seltzer.  The  utility  of  file  names.  Technical  report  TR-05- 
03.  Harvard  University,  March  2003. 

[12]  Daniel  Ellard  and  Margo  Seltzer.  New  NES  bacing  tools  and  techniques  for  system  analysis.  Systems 
Administration  Conference  (San  Diego,  CA),  pages  73-85.  Usenix  Association,  26-31  October  2003. 

[13]  Garth  A.  Gibson,  David  E.  Nagle,  Khalil  Amiri,  Jeff  Butler,  Eay  W.  Chang,  Howard  Gobioff,  Charles 
Hardin,  Erik  Riedel,  David  Rochberg,  and  Jim  Zelenka.  A  cost-effective,  high-bandwidth  storage 
architecture.  Architectural  Support  for  Programming  Languages  and  Operating  Systems  (San  Jose, 
CA,  3-7  October  1998).  Published  as  SIGPLAN Notices,  33(11):92-103,  November  1998. 

[14]  Garth  A.  Gibson,  David  E  Nagle,  Khalil  Amiri,  Eay  W.  Chang,  Eugene  M.  Eeinberg,  Howard  Gobioff, 
Chen  Eee,  Berend  Ozceri,  Erik  Riedel,  David  Rochberg,  and  Jim  Zelenka.  Eile  server  scaling  with 
network-attached  secure  disks.  ACM  SIGMETRICS  Conference  on  Measurement  and  Modeling  of 
Computer  Systems  (Seattle,  WA,  15-18  June  1997).  Published  as  Performance  Evaluation  Review, 
25(l):272-284.  ACM,  June  1997. 


15 


[15]  David  Hitz,  James  Lau,  and  Michael  Malcolm.  File  system  design  for  an  NFS  file  server  appliance. 
Winter  USENIX  Technical  Conference  (San  Francisco,  CA,  17-21  January  1994),  pages  235-246. 
USENIX  Association,  1994. 

[16]  John  H.  Howard,  Michael  L.  Kazar,  Sherri  G.  Menees,  David  A.  Nichols,  M.  Satyanarayanan, 
Robert  N.  Sidebotham,  and  Michael  J.  West.  Scale  and  performance  in  a  distributed  file  system. 
ACM  Transactions  on  Computer  Systems  (TOCS),  6(1):51-81.  ACM,  February  1988. 

[17]  Edward  K.  Eee  and  Chandramohan  A.  Thekkath.  Petal:  distributed  virtual  disks.  Architectural  Support 
for  Programming  Languages  and  Operating  Systems  (Cambridge,  MA,  1-5  October  1996).  Published 
as  SIGPLAN Notices,  31(9):84-92,  1996. 

[18]  Eustre,  Apr  2006.  http://www.lustre.org/. 

[19]  Marshall  Kirk  McKusick.  Running  ’fsck’  in  the  background.  BSDCon  Conference  (San  Erancisco, 
CA,  11-14  Eebruary  2002),  2002. 

[20]  Athicha  Muthitacharoen,  Robert  Morris,  Thomer  M.  Gil,  and  Benjie  Chen.  Ivy:  a  read/write  peer-to- 
peer  file  system.  Symposium  on  Operating  Systems  Design  and  Implementation  (Boston,  MA,  09-11 
December  2002).  USENIX  Association,  2002. 

[21]  When  to  Use  Transactional  NTES,  Apr  2006.  http://msdn.microsoft.com/library/en— us/fileio/ 

fs/when_to  _use  .transactional  mtfs .  asp . 

[22]  Hugo  Patterson,  Stephen  Manley,  Mike  Eederwisch,  Dave  Hitz,  Steve  Kleiman,  and  Shane  Owara. 
SnapMirror:  file  system  based  asynchronous  mirroring  for  disaster  recovery.  Conference  on  File  and 
Storage  Technologies  (Monterey,  CA,  28-30  January  2002),  pages  117-129.  USENIX  Association, 
2002. 

[23]  Reiser4  Transaction  Design  Document,  Apr  2006.  http://www.namesys.com/txn-doc.html/. 

[24]  Yasushi  Saito  and  Christos  Karamanolis.  Name  space  consistency  in  the  Pangaea  wide-area  file  system. 
HP  Eaboratories  SSP  Technical  Report  HPE-SSP-2002-12.  HP  Eabs,  December  2002. 

[25]  Prank  Schmuck  and  Roger  Haskin.  GPPS:  a  shared-disk  file  system  for  large  computing  clusters. 
Conference  on  File  and  Storage  Technologies  (Monterey,  CA,  28-30  January  2002),  pages  231-244. 
USENIX  Association,  2002. 

[26]  Pred  B.  Schneider.  Implementing  fault-tolerant  services  using  the  state  machine  approach:  a  tutorial. 
ACM  Computing  Surveys,  22(4):299-319,  December  1990. 

[27]  Douglas  B.  Terry,  Marvin  M.  Theimer,  Karin  Petersen,  Alan  J.  Demers,  Mike  J.  Spreitzer,  and  Carl  H. 
Hauser.  Managing  update  conflicts  in  Bayou,  a  weakly  connected  replicated  storage  system.  ACM 
Symposium  on  Operating  System  Principles  (Copper  Mountain  Resort,  CO,  3-6  December  1995). 
Published  as  Operating  Systems  Review,  29(5),  1995. 

[28]  Chandramohan  A.  Thekkath,  Timothy  Mann,  and  Edward  K.  Eee.  Prangipani:  a  scalable  distributed 
file  system.  ACM  Symposium  on  Operating  System  Principles  (Saint-Malo,  Prance,  5-8  October  1997). 
Published  as  Operating  Systems  Review,  31(5):224— 237.  ACM,  1997. 

[29]  R.  O.  Weber.  Information  technology  -  SCSI  Object-Based  Storage  Device  Commands  (OSD).  Tech¬ 
nical  Council  Proposal  Document  T10/1355-D,  ANSI  Technical  Committee  TIO,  July  2004. 


16 


[30]  Jay  J.  Wylie,  Michael  W.  Bigrigg,  John  D.  Strunk,  Gregory  R.  Ganger,  Han  Kiliccote,  and  Pradeep  K. 
Khosla.  Survivable  information  storage  systems.  IEEE  Computer,  33(8):61-68.  IEEE,  August  2000. 


17 


